{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Download and store data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This notebook contains information on downloading the Quandl Wiki stock prices and a few other sources that we use throughout the book. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Imports & Settings"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-06-18T14:32:07.092623Z",
     "start_time": "2020-06-18T14:32:07.090885Z"
    }
   },
   "outputs": [],
   "source": [
    "import warnings\n",
    "warnings.filterwarnings('ignore')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-06-18T14:32:07.263130Z",
     "start_time": "2020-06-18T14:32:07.259861Z"
    }
   },
   "outputs": [],
   "source": [
    "from pathlib import Path\n",
    "import requests\n",
    "from io import BytesIO\n",
    "from zipfile import ZipFile, BadZipFile\n",
    "\n",
    "import numpy as np\n",
    "import pandas as pd\n",
    "import pandas_datareader.data as web\n",
    "from sklearn.datasets import fetch_openml\n",
    "\n",
    "pd.set_option('display.expand_frame_repr', False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Set Data Store path"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Modify path if you would like to store the data elsewhere and change the notebooks accordingly"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 93,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-06-19T02:27:54.832609Z",
     "start_time": "2020-06-19T02:27:54.824778Z"
    }
   },
   "outputs": [],
   "source": [
    "DATA_STORE = Path('assets.h5')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Quandl Wiki Prices"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "[Quandl](https://www.quandl.com/) makes available a [dataset](https://www.quandl.com/databases/WIKIP/documentation) with stock prices, dividends and splits for 3000 US publicly-traded companies. Quandl decided to discontinue support in favor of its commercial offerings but the historical data are still useful to demonstrate the application of the machine learning solutions in the book, just ensure you implement your own algorithms on current data.\n",
    "\n",
    "> As of April 11, 2018 this data feed is no longer actively supported by the Quandl community. We will continue to host this data feed on Quandl, but we do not recommend using it for investment or analysis."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "1. Follow the instructions to create a free [Quandl]([Quandl](https://www.quandl.com/)) account\n",
    "2. [Download](https://www.quandl.com/databases/WIKIP/usage/export) the entire WIKI/PRICES data\n",
    "3. Extract the .zip file,\n",
    "4. Move to this directory and rename to wiki_prices.csv\n",
    "5. Run the below code to store in fast HDF format (see [Chapter 02 on Market & Fundamental Data](../02_market_and_fundamental_data) for details)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-06-16T21:38:46.759327Z",
     "start_time": "2020-06-16T21:37:48.398856Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<class 'pandas.core.frame.DataFrame'>\n",
      "MultiIndex: 15389314 entries, (Timestamp('1962-01-02 00:00:00'), 'ARNC') to (Timestamp('2018-03-27 00:00:00'), 'ZUMZ')\n",
      "Data columns (total 12 columns):\n",
      " #   Column       Non-Null Count     Dtype  \n",
      "---  ------       --------------     -----  \n",
      " 0   open         15388776 non-null  float64\n",
      " 1   high         15389259 non-null  float64\n",
      " 2   low          15389259 non-null  float64\n",
      " 3   close        15389313 non-null  float64\n",
      " 4   volume       15389314 non-null  float64\n",
      " 5   ex-dividend  15389314 non-null  float64\n",
      " 6   split_ratio  15389313 non-null  float64\n",
      " 7   adj_open     15388776 non-null  float64\n",
      " 8   adj_high     15389259 non-null  float64\n",
      " 9   adj_low      15389259 non-null  float64\n",
      " 10  adj_close    15389313 non-null  float64\n",
      " 11  adj_volume   15389314 non-null  float64\n",
      "dtypes: float64(12)\n",
      "memory usage: 1.4+ GB\n",
      "None\n"
     ]
    }
   ],
   "source": [
    "df = (pd.read_csv('wiki_prices.csv',\n",
    "                 parse_dates=['date'],\n",
    "                 index_col=['date', 'ticker'],\n",
    "                 infer_datetime_format=True)\n",
    "     .sort_index())\n",
    "\n",
    "print(df.info(null_counts=True))\n",
    "with pd.HDFStore(DATA_STORE) as store:\n",
    "    store.put('quandl/wiki/prices', df)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Wiki Prices Metadata"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As of writing, the following instructions no longer work because Quandl changed its API:\n",
    "\n",
    "> 1. Follow the instructions to create a free [Quandl]([Quandl](https://www.quandl.com/)) account if you haven't done so yet\n",
    "> 2. Find link to download wiki metadata under Companies](https://www.quandl.com/databases/WIKIP/documentation) or use the download link with your API_KEY: https://www.quandl.com/api/v3/databases/WIKI/metadata?api_key=<API_KEY>\n",
    "> 3. Extract the .zip file,\n",
    "> 4. Move to this directory and rename to wiki_stocks.csv\n",
    "> 5. Run the following code to store in fast HDF format"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Instead, load the file `wiki_stocks.csv` as described and store in HDF5 format."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<class 'pandas.core.frame.DataFrame'>\n",
      "RangeIndex: 3199 entries, 0 to 3198\n",
      "Data columns (total 2 columns):\n",
      " #   Column  Non-Null Count  Dtype \n",
      "---  ------  --------------  ----- \n",
      " 0   code    3199 non-null   object\n",
      " 1   name    3199 non-null   object\n",
      "dtypes: object(2)\n",
      "memory usage: 50.1+ KB\n",
      "None\n"
     ]
    }
   ],
   "source": [
    "df = pd.read_csv('wiki_stocks.csv')\n",
    "# no longer needed\n",
    "# df = pd.concat([df.loc[:, 'code'].str.strip(),\n",
    "#                 df.loc[:, 'name'].str.split('(', expand=True)[0].str.strip().to_frame('name')], axis=1)\n",
    "\n",
    "print(df.info(null_counts=True))\n",
    "with pd.HDFStore(DATA_STORE) as store:\n",
    "    store.put('quandl/wiki/stocks', df)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## S&P 500 Prices"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The following code downloads historical S&P 500 prices from FRED (only last 10 years of daily data is freely available)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<class 'pandas.core.frame.DataFrame'>\n",
      "DatetimeIndex: 2609 entries, 2010-06-16 to 2020-06-15\n",
      "Data columns (total 1 columns):\n",
      " #   Column  Non-Null Count  Dtype  \n",
      "---  ------  --------------  -----  \n",
      " 0   close   2517 non-null   float64\n",
      "dtypes: float64(1)\n",
      "memory usage: 40.8 KB\n",
      "None\n"
     ]
    }
   ],
   "source": [
    "df = web.DataReader(name='SP500', data_source='fred', start=2009).squeeze().to_frame('close')\n",
    "print(df.info())\n",
    "with pd.HDFStore(DATA_STORE) as store:\n",
    "    store.put('sp500/fred', df)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Alternatively, download S&P500 data from [stooq.com](https://stooq.com/q/?s=%5Espx&c=1d&t=l&a=lg&b=0); at the time of writing the data was available since 1789. You can switch from Polish to English on the lower right-hand side.\n",
    "\n",
    "We store the data from 1950-2020:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-06-16T21:42:08.524471Z",
     "start_time": "2020-06-16T21:42:08.415664Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<class 'pandas.core.frame.DataFrame'>\n",
      "DatetimeIndex: 17700 entries, 1950-01-03 to 2019-12-31\n",
      "Data columns (total 5 columns):\n",
      " #   Column  Non-Null Count  Dtype  \n",
      "---  ------  --------------  -----  \n",
      " 0   open    17700 non-null  float64\n",
      " 1   high    17700 non-null  float64\n",
      " 2   low     17700 non-null  float64\n",
      " 3   close   17700 non-null  float64\n",
      " 4   volume  17700 non-null  float64\n",
      "dtypes: float64(5)\n",
      "memory usage: 829.7 KB\n",
      "None\n"
     ]
    }
   ],
   "source": [
    "sp500_stooq = (pd.read_csv('^spx_d.csv', index_col=0,\n",
    "                     parse_dates=True).loc['1950':'2019'].rename(columns=str.lower))\n",
    "print(sp500_stooq.info())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-06-16T21:42:09.567451Z",
     "start_time": "2020-06-16T21:42:09.556832Z"
    }
   },
   "outputs": [],
   "source": [
    "with pd.HDFStore(DATA_STORE) as store:\n",
    "    store.put('sp500/stooq', sp500_stooq)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### S&P 500 Constituents"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The following code downloads the current S&P 500 constituents from [Wikipedia](https://en.wikipedia.org/wiki/List_of_S%26P_500_companies)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-06-17T10:50:19.998830Z",
     "start_time": "2020-06-17T10:50:18.784120Z"
    }
   },
   "outputs": [],
   "source": [
    "url = 'https://en.wikipedia.org/wiki/List_of_S%26P_500_companies'\n",
    "df = pd.read_html(url, header=0)[0]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-06-17T10:50:21.745331Z",
     "start_time": "2020-06-17T10:50:21.724637Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Symbol</th>\n",
       "      <th>Security</th>\n",
       "      <th>SEC filings</th>\n",
       "      <th>GICS Sector</th>\n",
       "      <th>GICS Sub Industry</th>\n",
       "      <th>Headquarters Location</th>\n",
       "      <th>Date first added</th>\n",
       "      <th>CIK</th>\n",
       "      <th>Founded</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>MMM</td>\n",
       "      <td>3M Company</td>\n",
       "      <td>reports</td>\n",
       "      <td>Industrials</td>\n",
       "      <td>Industrial Conglomerates</td>\n",
       "      <td>St. Paul, Minnesota</td>\n",
       "      <td>1976-08-09</td>\n",
       "      <td>66740</td>\n",
       "      <td>1902</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>ABT</td>\n",
       "      <td>Abbott Laboratories</td>\n",
       "      <td>reports</td>\n",
       "      <td>Health Care</td>\n",
       "      <td>Health Care Equipment</td>\n",
       "      <td>North Chicago, Illinois</td>\n",
       "      <td>1964-03-31</td>\n",
       "      <td>1800</td>\n",
       "      <td>1888</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>ABBV</td>\n",
       "      <td>AbbVie Inc.</td>\n",
       "      <td>reports</td>\n",
       "      <td>Health Care</td>\n",
       "      <td>Pharmaceuticals</td>\n",
       "      <td>North Chicago, Illinois</td>\n",
       "      <td>2012-12-31</td>\n",
       "      <td>1551152</td>\n",
       "      <td>2013 (1888)</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>ABMD</td>\n",
       "      <td>ABIOMED Inc</td>\n",
       "      <td>reports</td>\n",
       "      <td>Health Care</td>\n",
       "      <td>Health Care Equipment</td>\n",
       "      <td>Danvers, Massachusetts</td>\n",
       "      <td>2018-05-31</td>\n",
       "      <td>815094</td>\n",
       "      <td>1981</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>ACN</td>\n",
       "      <td>Accenture plc</td>\n",
       "      <td>reports</td>\n",
       "      <td>Information Technology</td>\n",
       "      <td>IT Consulting &amp; Other Services</td>\n",
       "      <td>Dublin, Ireland</td>\n",
       "      <td>2011-07-06</td>\n",
       "      <td>1467373</td>\n",
       "      <td>1989</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  Symbol             Security SEC filings             GICS Sector               GICS Sub Industry    Headquarters Location Date first added      CIK      Founded\n",
       "0    MMM           3M Company     reports             Industrials        Industrial Conglomerates      St. Paul, Minnesota       1976-08-09    66740         1902\n",
       "1    ABT  Abbott Laboratories     reports             Health Care           Health Care Equipment  North Chicago, Illinois       1964-03-31     1800         1888\n",
       "2   ABBV          AbbVie Inc.     reports             Health Care                 Pharmaceuticals  North Chicago, Illinois       2012-12-31  1551152  2013 (1888)\n",
       "3   ABMD          ABIOMED Inc     reports             Health Care           Health Care Equipment   Danvers, Massachusetts       2018-05-31   815094         1981\n",
       "4    ACN        Accenture plc     reports  Information Technology  IT Consulting & Other Services          Dublin, Ireland       2011-07-06  1467373         1989"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-06-17T10:51:06.628859Z",
     "start_time": "2020-06-17T10:51:06.621125Z"
    }
   },
   "outputs": [],
   "source": [
    "df.columns = ['ticker', 'name', 'sec_filings', 'gics_sector', 'gics_sub_industry',\n",
    "              'location', 'first_added', 'cik', 'founded']\n",
    "df = df.drop('sec_filings', axis=1).set_index('ticker')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-06-17T10:51:14.523579Z",
     "start_time": "2020-06-17T10:51:14.515004Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<class 'pandas.core.frame.DataFrame'>\n",
      "Index: 505 entries, MMM to ZTS\n",
      "Data columns (total 7 columns):\n",
      " #   Column             Non-Null Count  Dtype \n",
      "---  ------             --------------  ----- \n",
      " 0   name               505 non-null    object\n",
      " 1   gics_sector        505 non-null    object\n",
      " 2   gics_sub_industry  505 non-null    object\n",
      " 3   location           505 non-null    object\n",
      " 4   first_added        408 non-null    object\n",
      " 5   cik                505 non-null    int64 \n",
      " 6   founded            234 non-null    object\n",
      "dtypes: int64(1), object(6)\n",
      "memory usage: 31.6+ KB\n",
      "None\n"
     ]
    }
   ],
   "source": [
    "print(df.info())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-06-17T10:51:17.814771Z",
     "start_time": "2020-06-17T10:51:17.788608Z"
    }
   },
   "outputs": [],
   "source": [
    "with pd.HDFStore(DATA_STORE) as store:\n",
    "    store.put('sp500/stocks', df)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Metadata on US-traded companies"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The following downloads several attributes for [companies](https://www.nasdaq.com/screening/companies-by-name.aspx) traded on NASDAQ, AMEX and NYSE"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<class 'pandas.core.frame.DataFrame'>\n",
      "Index: 6988 entries, TXG to ZYME\n",
      "Data columns (total 6 columns):\n",
      " #   Column     Non-Null Count  Dtype  \n",
      "---  ------     --------------  -----  \n",
      " 0   name       6988 non-null   object \n",
      " 1   lastsale   6815 non-null   float64\n",
      " 2   marketcap  5383 non-null   object \n",
      " 3   ipoyear    3228 non-null   float64\n",
      " 4   sector     5323 non-null   object \n",
      " 5   industry   5323 non-null   object \n",
      "dtypes: float64(2), object(4)\n",
      "memory usage: 382.2+ KB\n",
      "None\n"
     ]
    }
   ],
   "source": [
    "url = 'https://old.nasdaq.com/screening/companies-by-name.aspx?letter=0&exchange={}&render=download'\n",
    "exchanges = ['NASDAQ', 'AMEX', 'NYSE']\n",
    "df = pd.concat([pd.read_csv(url.format(ex)) for ex in exchanges]).dropna(how='all', axis=1)\n",
    "df = df.rename(columns=str.lower).set_index('symbol').drop('summary quote', axis=1)\n",
    "df = df[~df.index.duplicated()]\n",
    "print(df.info()) "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>name</th>\n",
       "      <th>lastsale</th>\n",
       "      <th>marketcap</th>\n",
       "      <th>ipoyear</th>\n",
       "      <th>sector</th>\n",
       "      <th>industry</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>symbol</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>TXG</th>\n",
       "      <td>10x Genomics, Inc.</td>\n",
       "      <td>88.4200</td>\n",
       "      <td>$8.7B</td>\n",
       "      <td>2019.0</td>\n",
       "      <td>Capital Goods</td>\n",
       "      <td>Biotechnology: Laboratory Analytical Instruments</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>YI</th>\n",
       "      <td>111, Inc.</td>\n",
       "      <td>6.6200</td>\n",
       "      <td>$545.22M</td>\n",
       "      <td>2018.0</td>\n",
       "      <td>Health Care</td>\n",
       "      <td>Medical/Nursing Services</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>PIH</th>\n",
       "      <td>1347 Property Insurance Holdings, Inc.</td>\n",
       "      <td>4.5443</td>\n",
       "      <td>$27.58M</td>\n",
       "      <td>2014.0</td>\n",
       "      <td>Finance</td>\n",
       "      <td>Property-Casualty Insurers</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>PIHPP</th>\n",
       "      <td>1347 Property Insurance Holdings, Inc.</td>\n",
       "      <td>25.4202</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Finance</td>\n",
       "      <td>Property-Casualty Insurers</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>TURN</th>\n",
       "      <td>180 Degree Capital Corp.</td>\n",
       "      <td>1.8300</td>\n",
       "      <td>$56.95M</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Finance</td>\n",
       "      <td>Finance/Investors Services</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                          name  lastsale marketcap  ipoyear         sector                                          industry\n",
       "symbol                                                                                                                                      \n",
       "TXG                         10x Genomics, Inc.   88.4200     $8.7B   2019.0  Capital Goods  Biotechnology: Laboratory Analytical Instruments\n",
       "YI                                   111, Inc.    6.6200  $545.22M   2018.0    Health Care                          Medical/Nursing Services\n",
       "PIH     1347 Property Insurance Holdings, Inc.    4.5443   $27.58M   2014.0        Finance                        Property-Casualty Insurers\n",
       "PIHPP   1347 Property Insurance Holdings, Inc.   25.4202       NaN      NaN        Finance                        Property-Casualty Insurers\n",
       "TURN                  180 Degree Capital Corp.    1.8300   $56.95M      NaN        Finance                        Finance/Investors Services"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Convert market cap information to numerical format"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Market cap is provided as strings so we need to convert it to numerical format."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "M    3148\n",
       "B    2235\n",
       "Name: suffix, dtype: int64"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "mcap = df[['marketcap']].dropna()\n",
    "mcap['suffix'] = mcap.marketcap.str[-1]\n",
    "mcap.suffix.value_counts()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Keep only values with value units:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<class 'pandas.core.frame.DataFrame'>\n",
      "Index: 5383 entries, TXG to ZYME\n",
      "Data columns (total 2 columns):\n",
      " #   Column     Non-Null Count  Dtype  \n",
      "---  ------     --------------  -----  \n",
      " 0   marketcap  5383 non-null   float64\n",
      " 1   suffix     5383 non-null   object \n",
      "dtypes: float64(1), object(1)\n",
      "memory usage: 286.2+ KB\n"
     ]
    }
   ],
   "source": [
    "mcap = mcap[mcap.suffix.str.endswith(('B', 'M'))]\n",
    "mcap.marketcap = pd.to_numeric(mcap.marketcap.str[1:-1])\n",
    "mcaps = {'M': 1e6, 'B': 1e9}\n",
    "for symbol, factor in mcaps.items():\n",
    "    mcap.loc[mcap.suffix == symbol, 'marketcap'] *= factor\n",
    "mcap.info()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "count                5,383\n",
       "mean         8,058,312,556\n",
       "std         46,063,490,648\n",
       "min              1,680,000\n",
       "10%             41,436,000\n",
       "20%            104,184,000\n",
       "30%            192,888,000\n",
       "40%            335,156,000\n",
       "50%            587,760,000\n",
       "60%          1,120,000,000\n",
       "70%          2,140,000,000\n",
       "80%          4,480,000,000\n",
       "90%         13,602,000,000\n",
       "max      1,486,630,000,000\n",
       "Name: marketcap, dtype: object"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df['marketcap'] = mcap.marketcap\n",
    "df.marketcap.describe(percentiles=np.arange(.1, 1, .1).round(1)).apply(lambda x: f'{int(x):,d}')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Store result"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The file `us_equities_meta_data.csv` contains a version of the data used for many of the examples. Load using \n",
    "```\n",
    "df = pd.read_csv('us_equities_meta_data.csv')\n",
    "```\n",
    "and proceed to store in HDF5 format."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-06-16T21:35:00.910722Z",
     "start_time": "2020-06-16T21:35:00.862761Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<class 'pandas.core.frame.DataFrame'>\n",
      "RangeIndex: 6834 entries, 0 to 6833\n",
      "Data columns (total 7 columns):\n",
      " #   Column     Non-Null Count  Dtype  \n",
      "---  ------     --------------  -----  \n",
      " 0   ticker     6834 non-null   object \n",
      " 1   name       6834 non-null   object \n",
      " 2   lastsale   6718 non-null   float64\n",
      " 3   marketcap  5766 non-null   float64\n",
      " 4   ipoyear    3038 non-null   float64\n",
      " 5   sector     5288 non-null   object \n",
      " 6   industry   5288 non-null   object \n",
      "dtypes: float64(3), object(4)\n",
      "memory usage: 373.9+ KB\n"
     ]
    }
   ],
   "source": [
    "df = pd.read_csv('us_equities_meta_data.csv')\n",
    "df.info()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-06-16T21:35:22.388412Z",
     "start_time": "2020-06-16T21:35:22.356828Z"
    }
   },
   "outputs": [],
   "source": [
    "with pd.HDFStore(DATA_STORE) as store:\n",
    "    store.put('us_equities/stocks', df.set_index('ticker'))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## MNIST Data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 36,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-06-18T17:19:11.720146Z",
     "start_time": "2020-06-18T17:18:53.948739Z"
    }
   },
   "outputs": [],
   "source": [
    "mnist = fetch_openml('mnist_784', version=1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 37,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-06-18T17:19:11.723222Z",
     "start_time": "2020-06-18T17:19:11.721079Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "**Author**: Yann LeCun, Corinna Cortes, Christopher J.C. Burges  \n",
      "**Source**: [MNIST Website](http://yann.lecun.com/exdb/mnist/) - Date unknown  \n",
      "**Please cite**:  \n",
      "\n",
      "The MNIST database of handwritten digits with 784 features, raw data available at: http://yann.lecun.com/exdb/mnist/. It can be split in a training set of the first 60,000 examples, and a test set of 10,000 examples  \n",
      "\n",
      "It is a subset of a larger set available from NIST. The digits have been size-normalized and centered in a fixed-size image. It is a good database for people who want to try learning techniques and pattern recognition methods on real-world data while spending minimal efforts on preprocessing and formatting. The original black and white (bilevel) images from NIST were size normalized to fit in a 20x20 pixel box while preserving their aspect ratio. The resulting images contain grey levels as a result of the anti-aliasing technique used by the normalization algorithm. the images were centered in a 28x28 image by computing the center of mass of the pixels, and translating the image so as to position this point at the center of the 28x28 field.  \n",
      "\n",
      "With some classification methods (particularly template-based methods, such as SVM and K-nearest neighbors), the error rate improves when the digits are centered by bounding box rather than center of mass. If you do this kind of pre-processing, you should report it in your publications. The MNIST database was constructed from NIST's NIST originally designated SD-3 as their training set and SD-1 as their test set. However, SD-3 is much cleaner and easier to recognize than SD-1. The reason for this can be found on the fact that SD-3 was collected among Census Bureau employees, while SD-1 was collected among high-school students. Drawing sensible conclusions from learning experiments requires that the result be independent of the choice of training set and test among the complete set of samples. Therefore it was necessary to build a new database by mixing NIST's datasets.  \n",
      "\n",
      "The MNIST training set is composed of 30,000 patterns from SD-3 and 30,000 patterns from SD-1. Our test set was composed of 5,000 patterns from SD-3 and 5,000 patterns from SD-1. The 60,000 pattern training set contained examples from approximately 250 writers. We made sure that the sets of writers of the training set and test set were disjoint. SD-1 contains 58,527 digit images written by 500 different writers. In contrast to SD-3, where blocks of data from each writer appeared in sequence, the data in SD-1 is scrambled. Writer identities for SD-1 is available and we used this information to unscramble the writers. We then split SD-1 in two: characters written by the first 250 writers went into our new training set. The remaining 250 writers were placed in our test set. Thus we had two sets with nearly 30,000 examples each. The new training set was completed with enough examples from SD-3, starting at pattern # 0, to make a full set of 60,000 training patterns. Similarly, the new test set was completed with SD-3 examples starting at pattern # 35,000 to make a full set with 60,000 test patterns. Only a subset of 10,000 test images (5,000 from SD-1 and 5,000 from SD-3) is available on this site. The full 60,000 sample training set is available.\n",
      "\n",
      "Downloaded from openml.org.\n"
     ]
    }
   ],
   "source": [
    "print(mnist.DESCR)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 38,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-06-18T17:19:21.823457Z",
     "start_time": "2020-06-18T17:19:21.815369Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "dict_keys(['data', 'target', 'frame', 'categories', 'feature_names', 'target_names', 'DESCR', 'details', 'url'])"
      ]
     },
     "execution_count": 38,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "mnist.keys()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 39,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-06-18T17:19:59.989715Z",
     "start_time": "2020-06-18T17:19:59.983320Z"
    }
   },
   "outputs": [],
   "source": [
    "mnist_path = Path('mnist')\n",
    "if not mnist_path.exists():\n",
    "    mnist_path.mkdir()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 40,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-06-18T17:20:25.489271Z",
     "start_time": "2020-06-18T17:20:25.418621Z"
    }
   },
   "outputs": [],
   "source": [
    "np.save(mnist_path / 'data', mnist.data.astype(np.uint8))\n",
    "np.save(mnist_path / 'labels', mnist.target.astype(np.uint8))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Fashion MNIST Image Data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We will use the Fashion MNIST image data created by [Zalando Research](https://github.com/zalandoresearch/fashion-mnist) for some demonstrations."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-06-18T14:42:51.598398Z",
     "start_time": "2020-06-18T14:42:38.233167Z"
    }
   },
   "outputs": [],
   "source": [
    "fashion_mnist = fetch_openml(name='Fashion-MNIST')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-06-18T14:42:51.604464Z",
     "start_time": "2020-06-18T14:42:51.599373Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "**Author**: Han Xiao, Kashif Rasul, Roland Vollgraf  \n",
      "**Source**: [Zalando Research](https://github.com/zalandoresearch/fashion-mnist)  \n",
      "**Please cite**: Han Xiao and Kashif Rasul and Roland Vollgraf, Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms, arXiv, cs.LG/1708.07747  \n",
      "\n",
      "Fashion-MNIST is a dataset of Zalando's article images, consisting of a training set of 60,000 examples and a test set of 10,000 examples. Each example is a 28x28 grayscale image, associated with a label from 10 classes. Fashion-MNIST is intended to serve as a direct drop-in replacement for the original MNIST dataset for benchmarking machine learning algorithms. It shares the same image size and structure of training and testing splits. \n",
      "\n",
      "Raw data available at: https://github.com/zalandoresearch/fashion-mnist\n",
      "\n",
      "### Target classes\n",
      "Each training and test example is assigned to one of the following labels:\n",
      "Label  Description  \n",
      "0  T-shirt/top  \n",
      "1  Trouser  \n",
      "2  Pullover  \n",
      "3  Dress  \n",
      "4  Coat  \n",
      "5  Sandal  \n",
      "6  Shirt  \n",
      "7  Sneaker  \n",
      "8  Bag  \n",
      "9  Ankle boot\n",
      "\n",
      "Downloaded from openml.org.\n"
     ]
    }
   ],
   "source": [
    "print(fashion_mnist.DESCR)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-06-18T15:14:10.969125Z",
     "start_time": "2020-06-18T15:14:10.960466Z"
    }
   },
   "outputs": [],
   "source": [
    "label_dict = {0: 'T-shirt/top',\n",
    "              1: 'Trouser',\n",
    "              2: 'Pullover',\n",
    "              3: 'Dress',\n",
    "              4: 'Coat',\n",
    "              5: 'Sandal',\n",
    "              6: 'Shirt',\n",
    "              7: 'Sneaker',\n",
    "              8: 'Bag',\n",
    "              9: 'Ankle boot'}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-06-18T15:14:11.293906Z",
     "start_time": "2020-06-18T15:14:11.290279Z"
    }
   },
   "outputs": [],
   "source": [
    "fashion_path = Path('fashion_mnist')\n",
    "if not fashion_path.exists():\n",
    "    fashion_path.mkdir()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-06-18T15:14:11.941092Z",
     "start_time": "2020-06-18T15:14:11.926416Z"
    }
   },
   "outputs": [],
   "source": [
    "pd.Series(label_dict).to_csv(fashion_path / 'label_dict.csv', index=False, header=None)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-06-18T14:45:50.376114Z",
     "start_time": "2020-06-18T14:45:50.301028Z"
    }
   },
   "outputs": [],
   "source": [
    "np.save(fashion_path / 'data', fashion_mnist.data.astype(np.uint8))\n",
    "np.save(fashion_path / 'labels', fashion_mnist.target.astype(np.uint8))\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Bond Price Indexes"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The following code downloads several bond indexes from the Federal Reserve Economic Data service ([FRED](https://fred.stlouisfed.org/))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {},
   "outputs": [],
   "source": [
    "securities = {'BAMLCC0A0CMTRIV'   : 'US Corp Master TRI',\n",
    "              'BAMLHYH0A0HYM2TRIV': 'US High Yield TRI',\n",
    "              'BAMLEMCBPITRIV'    : 'Emerging Markets Corporate Plus TRI',\n",
    "              'GOLDAMGBD228NLBM'  : 'Gold (London, USD)',\n",
    "              'DGS10'             : '10-Year Treasury CMR',\n",
    "              }\n",
    "\n",
    "df = web.DataReader(name=list(securities.keys()), data_source='fred', start=2000)\n",
    "df = df.rename(columns=securities).dropna(how='all').resample('B').mean()\n",
    "\n",
    "with pd.HDFStore(DATA_STORE) as store:\n",
    "    store.put('fred/assets', df)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.7"
  },
  "toc": {
   "base_numbering": 1,
   "nav_menu": {},
   "number_sections": true,
   "sideBar": true,
   "skip_h1_title": true,
   "title_cell": "Table of Contents",
   "title_sidebar": "Contents",
   "toc_cell": false,
   "toc_position": {},
   "toc_section_display": true,
   "toc_window_display": true
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
